Load the necessary libraries
library(mgcv) #for GAMs
library(broom) #for tidy results
library(gratia) #for GAM plots
library(DHARMa) #for residual diagnostics
library(performance) #for residuals diagnostics
library(see) #for plotting residuals
library(emmeans) #for marginal means etc
library(MuMIn) #for model selection and AICc
library(tidyverse) #for data wrangling
This is an entirely fabricated example (how embarrising). So here is a picture of some Red Grouse Chicks to compensate..
Red grouse chicks
Format of data.gp.csv data file
| x | y |
|---|---|
| 2 | 3 |
| 4 | 5 |
| 8 | 6 |
| 10 | 7 |
| 14 | 4 |
| x | - a continuous predictor |
| y | - a continuous response |
data_gam = read_csv('../public/data/data_gam.csv', trim_ws=TRUE)
##
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## x = col_double(),
## y = col_double()
## )
glimpse(data_gam)
## Rows: 5
## Columns: 2
## $ x <dbl> 2, 4, 8, 10, 14
## $ y <dbl> 3, 5, 6, 7, 4
Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\beta_0 + f(x_i)\\ f(x_i) = \sum^k_{j=1}{b_j(x_i)\beta_j} \]
where \(\beta_0\) is the y-intercept, and \(f(x)\) indicates an additive smoothing function of \(x\).
Although this is a ficticious example without a clear backstory, given that there are two continous predictors (and that one of these has been identified as a response and the other a predictor), we can assume that we might be interested in investigating the relationship between the two. As such, our typically starting point is to explore the basic trend between the two using a scatterplot.
ggplot(data_gam, aes(y=y, x=x))+
geom_point()+
geom_line()
This does not look like a particularly linear relationship. Lets fit a loess smoother..
ggplot(data_gam, aes(y=y, x=x))+
geom_point()+
geom_smooth()
And what would a linear smoother look like?
ggplot(data_gam, aes(y=y, x=x))+
geom_point()+
geom_smooth(method='lm')
Rather than either a loess or linear smoother, we can also try a Generalized Additive Model (GAM) smoother. Dont pay too much attention to the GAM formula at this stage, this will be discussed later in the Model Fitting section.
ggplot(data_gam, aes(y=y, x=x))+
geom_point()+
geom_smooth(method='gam', formula=y~s(x,k=3))
Conclusions:
Prior to fitting the GAM, it might be worth gaining a bit of an understanding of what will occur behind the scenes.
Lets say we intended to fit a smoother with three knots. The three knots equate to one at each end of the trend and one in the middle. We could reexpress our predictor (x) as three dummy variables that collectively reflect a spline (in this case, potentially two joined polynomials).
data.frame(smoothCon(s(x, k=3), data=data_gam)[[1]]$X) %>%
bind_cols(data_gam)
And we could visualize these dummies as three separate components.
basis(s(x, k=3), data=data_gam) %>% draw
basis(s(x, k=3, bs='cr'), data=data_gam) %>% draw
OR
newdata <-
data.frame(smoothCon(s(x, k=3), data=data_gam)[[1]]$X) %>%
bind_cols(data_gam)
ggplot(newdata, aes(x=x)) +
geom_line(aes(y=X1)) +
geom_line(aes(y=X2)) +
geom_line(aes(y=X3))
So with that in mind, lets fit the GAM.
data_gam.gam <- gam(y~s(x), data = data_gam, method = 'REML')
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): A term has fewer unique covariate combinations than specified maximum degrees of freedom
Conclusions:
We will optimise the degree of smoothing via REML.
data_gam.gam <- gam(y~s(x, k = 3), data=data_gam, method = 'REML')
#data.gp.gam <- gam(y~s(x,k=3, bs='cr'), data=data.gp, method=)
#data.gp.gam <- gam(y~s(x,k=3, bs='ps'), data=data.gp)
gam.check(data_gam.gam, pch = 19)
##
## Method: REML Optimizer: outer newton
## full convergence after 5 iterations.
## Gradient range [-4.151024e-06,2.176788e-06]
## (score 6.024247 & scale 0.4120262).
## Hessian positive definite, eigenvalue range [0.2676093,1.6827].
## Model rank = 3 / 3
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(x) 2.00 1.95 2.19 0.98
Conclusions:
The k.check() function provides a more targetted summary of the gam.check() output.
k.check(data_gam.gam)
## k' edf k-index p-value
## s(x) 2 1.949003 2.186796 0.9925
Conclusions:
The appraise() function provides the graphical diagnostics in ggplot.
appraise(data_gam.gam)
Conclusions:
performance::check_model(data_gam.gam)
## Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'gam'.
Conclusions:
The check_distribution() function attempts to provided some guidance as to which distributions are likely to be useful for modelling the response. Note, this is very much only experimental and not going to be all that useful for a small sample size.
performance::check_distribution(data_gam.gam)
Conclusions:
resids <- DHARMa::simulateResiduals(data_gam.gam, plot=TRUE)
Conclusions:
Concurvity is the non-linear equivalent of (multi)collinerity. Concurvity measures how well each smooth could be approximated by either a combination of the other smooth(s) or each individual other smooth. Three metrics are generated - all of which are bounded between 0 (no issue) and 1 (total lack of indentifiability) Essentially, we can decompose a smooth term into the component that lies within the space (g) of other terms and the component that is unique to the terms own space (f). The three metrics are all variants on the ratio of g:f
concurvity(data_gam.gam)
## para s(x)
## worst 3.194887e-30 3.487872e-30
## observed 3.194887e-30 3.148783e-30
## estimate 3.194887e-30 2.393887e-31
concurvity(data_gam.gam, full=FALSE)
## $worst
## para s(x)
## para 1.000000e+00 3.487872e-30
## s(x) 3.194887e-30 1.000000e+00
##
## $observed
## para s(x)
## para 1.000000e+00 3.148783e-30
## s(x) 3.194887e-30 1.000000e+00
##
## $estimate
## para s(x)
## para 1.000000e+00 2.393887e-31
## s(x) 3.194887e-30 1.000000e+00
Conclusions:
There are fewer options for plotting partial plots from GAM. Note, the plots are passed on to cowplot for multiple figures and thus it is not possible to use + to add more ggplot instructions.
draw(data_gam.gam)
## or with the 'partial residuals' (quasi observations) included
draw(data_gam.gam, residuals=TRUE)
summary(data_gam.gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x, k = 3)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0000 0.2871 17.42 0.00293 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x) 1.949 1.997 10.88 0.0861 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.835 Deviance explained = 91.5%
## -REML = 6.0242 Scale est. = 0.41203 n = 5
Conclusions:
Ref.df are of no real use anymore.F statistic and p-value test the null hypothesis that the trend is a straight, flat line. In this case, that is not rejected (due to a lack of power). This null hypothesis could be rejected either because the trend is wiggly or that it is linear, but with a slope different to zero.tidy(data_gam.gam)
AIC(data_gam.gam)
## [1] 13.29528
AICc(data_gam.gam)
## [1] 15375.87
For the purpose of further illustration, lets proceed as if the above analyses had suggested a significantly wiggly relationship. If this was the case, in addition to a qualitative description of the trend (for example, that the response intially increases to a peak before declining again), we might want to be able to provide estimates about:
If we had a formula for the curve, we could simply take the first order derivatives to find the slope (addressing the first two of the above) and the second order derivatives (to address the third of the above). Unfortunately, we do not have such an equation. However, it is possible to estimate these derivatives via finite differences. This is a technique in which predictions are made for a large sequence of values of a predictor (hence very small interval between predictions), and the instantaneous slopes are calculated between successive pairs of predictions.
derivatives(data_gam.gam, order=1) %>% draw
## Find the approximate peak
d = derivatives(data_gam.gam, order=1)
d
d %>% summarise(Value=data[which.min(abs(derivative))],
lower=data[which.min(abs(lower))],
upper=data[which.min(abs(upper))])
Conclusions:
If we wanted to find the value of the predictor that corresponded to the steepest positive slope:
derivatives(data_gam.gam, order=1) %>% draw
## Find the approximate peak
d = derivatives(data_gam.gam, order=1)
d
d %>% summarise(
maxDer = max(derivative),
Value=data[which.max(derivative)],
lower=data[which.min(abs(maxDer-lower))],
upper=data[which.min(abs(maxDer-upper))])
Conclusions:
derivatives(data_gam.gam, order=2) %>% draw
## Find the approximate peak
d = derivatives(data_gam.gam, order=2)
d
d %>% summarise(Value=data[which.max(abs(derivative))],
lower=data[which.max(abs(lower))],
upper=data[which.max(abs(upper))])
Conclusions:
Although it is possible to obtain partial plots using variety of methods (see above), all of these produce partial plots that are centered on the y-axis. Whist this is technically correct, as the GAM partial plot is concerned with describing the shape of the trend rather than the absolute value along the trend and it is not burdened with having to account for the impacts of other predictors on the absolute values, it does make it less useful as a visual representation of the nature of the relationship between the response and predictor(s).
So an alternate way to produce a useful summary figure is to base the figure on predictions (using emmeans() for convenience).
data_gam.list <- with(data_gam, list(x = modelr::seq_range(x, n = 100)))
newdata = emmeans(data_gam.gam, ~x, at = data_gam.list) %>%
as.data.frame
head(newdata)
ggplot(newdata, aes(y=emmean, x=x)) +
geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue', alpha=0.3) +
geom_line() +
geom_point(data=data_gam, aes(y=y,x=x))+
theme_bw()
newdata <- newdata %>% mutate(Peak = between(emmean, 7.91, 9.48))
ggplot(newdata, aes(y=emmean, x=x)) +
geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue', alpha=0.3) +
geom_line(aes(color=Peak)) +
geom_point(data=data_gam, aes(y=y,x=x))+
theme_bw()
## Partials
#resid.list = list(x=data_gam$x)
newdata.partial = data_gam %>%
mutate(Pred = predict(data_gam.gam, type='link'),
Res = resid(data_gam.gam),
Resid = Pred + Res)
#newdata.partial = emmeans(data_gam.gam, ~x, at=resid.list) %>%
# as.data.frame %>%
# mutate(Resid = emmean + resid(data_gam.gam))
ggplot(newdata, aes(y=emmean, x=x)) +
geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue', alpha=0.3) +
geom_line() +
geom_point(data=newdata.partial, aes(y=Resid,x=x))+
theme_bw()
# References